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Abstract 

We introduce a doubly stochastic marked point process model for supervised classification prob- 
lems. Regardless of the number of classes or the dimension of the feature space, the model requires 
only 2-3 parameters for the covariance function. The classification criterion involves a permanental 
ratio for which an approximation using a polynomial-time cyclic expansion is proposed. The ap- 
proximation is effective even if the feature region occupied by one class is a patchwork interlaced 
with regions occupied by other classes. An application to DNA microarray analysis indicates that 
the cyclic approximation is effective even for high-dimensional data. It can employ feature vari- 
ables in an efficient way to reduce the prediction error significantly. This is critical when the true 
classification relies on non-reducible high-dimensional features. 

Keywords: Cyclic approximation; DNA microarray analysis; High-dimensional data; Supervised 
classification; Weighted permanental ratio. 



1 Introduction 



In a typical supervised or unsupervised classification problem, each observation can be treated as a 
single point in the feature space X. The data set is a finite point configuration x = {xi, . . . 
with or withou t clas s labels y = ^ . . . , A Cox pro cess, or a doubly stochastic Poisson process 
( Cox & Isham 1980:l Kingmanl 1993; Daley & Vere-Jones 2003), provides a rich family of spatial point 
processes for aggregated point patterns. Unfortunately, for most Cox processes considered in the lit- 
erature, no closed form for the distribution of x is available. Markov chain Monte Carlo methods are 



commonly used for computational purposes. iMcCuUagh & M0lleii (2006) introduced a special class of 
Cox process, the permanental process, which is fairly flexible and has a closed form for the marginal 
density of x. 



McCuUagh & YangI (2006) proposed a classification model based on the permanental process. Re- 
gardless of the number of classes or the dimension of the feature variables, the model requires only 
2-3 parameters for fitting the covariance function of the random intensity. The method is effective even 
when the region predominantly occupied by one class is a patchwork interlaced with regions occupied 
predominantly by other classes. One problem of the permanental model i s that it requires the calculation 
of ratios of weighted permanents, which is an NP-hard problem (IValianti 1979). 



In the computer science literature, the best approx imation algorithm p roposed by iBezakova et al. 



(2008) runs at an unappealing rate of 0(n log n). iKou & McCuUaghl (2009) use an importance 



n_pr 

3c 



sampling estimator to approximate weighted permanents up to a few hundred points. 
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We propose a different way to solve the problem. It involves a series of approximations for the 
weighted permanental ratio based on its cyclic expansion. The classification based on cyclic approxi- 
mations works reasonably well for the examples studied. 



2 Classification Model Based on Permanental Process 



2.1 Permanental process 



FoUowing lMcCuUagh & M0lleii (2006), the permanental process on the feature space is a Cox process 
with random intensity function 



2a 



r=l 



where Zi , . . . , are independent and identically distributed Gaussian random fields with mean zero 
and covariance function C/2. For many applications, X = VJ^ or X C TZ'^. 

Typically, a spatial pattern consisting of n points {xi, . . . , x„} is observed within a compact subset 
S, or a bounded window, in X. If C is continuous on x S*, it has the spectral representation 



r [Xj ) 1 



r=0 



where A,- and are the eigenvalues and the normalized eigenfunctions of C on S, respectively. Define 
a new covariance function on S by 



K{xi,Xj) — 1 -)- _\, ^r{xi)er{xj). 



r=0 



We call K the covariance function of the permanental process on 5 x 5. Note that K = C if all 
eigenvalues are close to 0. 

The marginal density (IMcCuUagh & M0lleil 2006, Section 3.2) of the permanental process with 
respect to Lebesgue measure at x = {xi, . . . , Xn} is 



/(x) = e-°^per, {i^(x)} 



where D = J2'^o + K), and 



per^ {K{x)} = ^ a*''K (xi, - --K x^(„)) 



(1) 



(2) 



is the a-permanent of the n x n matrix K{x) with components K{xi,Xj) (|Vere-Jonesl 1988). Here 
the sum ru ns ove r all permutations of (1, . . . ,n) and indicates the number of cycles. The usual 
permanent (|Mind 1978) corresponds to a = 1, and per_]^ (A) = (-l)"det(^). 

Fo r general p ositive definite K, the permanental process is defined only for positive integer values 
of 2a (IBrandenL 2012), but if K{xi,Xj) is everywhere non-negative, the process can be extended to 
positive a. 

Unlike general Cox processes, the permanental process has its density function in expUcit form ([U. 
The flexibility in choosing a and K makes the permanental process potentially useful for applied work. 
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2.2 Classification model with finitely many classes 



For supervised classification problem with finitely many classes, the observations xi, . . . , x„ come from 
k possible classes. Assume that the observations in class r follow a permanental process with parameter 
Or and covariance function K as in ([T]). The superposition of k independent permanental processes with 
same K is a permanental process with parameter a. = J2r=i ^'^'^ same covariance function K. 



McCuUagh & YangJ (2006) show that the conditional distribution of the label vector y given the 



feature observations x is 

Pr 2/ x) = 1^ , (3) 

Per^. {K{x)} 

where x^^^ denotes the observations belonging to class r and per^, {K{x)} is defined in (|2]l. Note that 
per^, {K{(1))} = 1 for the empty set 0. 

For a supervised classification with known label vector y, the goal is to classify a new unit u' with 
observed feature vector x' into one of the k classes. Since the conditional distribution Q applies to the 
extended sample, the conditional distribution is given by the theorem as follows. 

Theorem 2.1 Given x and y, the conditional probability that a new unit u' with observed feature x' 
belongs to class r is 

prfti >->-r\x,x,y) gc , , , / ■ (4) 

^ ' ^ per„^{K(xM)} 

= 0, that is, no observation from class r has yet been observed, then the probability is propor- 
tional to arK{x' , x'). 



2.3 Classification model with infinitely many classes 

For many classification applications, for example, to identify species of animal or type of cancer, it is 
not appropriate to assume a finite number of classes in the population. We may consider the limit of ^ 
as ^ oo, = « — ^ for all r, and a. = fca = A > is fixed. Fixing the number of observations n, 
the limit distribution for the unlabelled partition i? of {1, . . . , n} is 

where #B is the number of blocks of B, x^''^ = {xi | i € 6} is the set of observations belonging to 
block b, and 

cyp{K{x)}= lim a~^per^{A'(x)} = V K {xi,x^n)) ■ ■ ■ K {xn,x^(ri)) 

a-:#(T=l 

is the sum of cyclic products. The product in ([5]) runs over all blocks of B. For example, B = 
{{1, 3}, {2}, {4, 5}} is a partition of {1, 2, 3, 4, 5}, then the blocks of B are {1, 3}, {2}, and {4, 5}, 
and the number of blocks #B = 3. By Q and the properties of conditional probability, we have 
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Theorem 2.2 Suppose there are infinitely many classes. Given B,x,X, the conditional probability of 
assigning a new unit u' with feature x' to block b & B is 



pr{u' I—)- 6 I X, x' , B, A) a 



cyp{K(x('') Ux')} 



cjp {K {xC'))} 

The conditional probability of assigning u' to a new class b = 9 is proportional to XK{x', x'). 



(6) 



If K is constant on X, equation ([5]l reduces to the Ewens sampling distribution ( Ewensl 1972: Pitman 
2006), and expression Q reduces to the seating plan of a Chinese restaurant process (lAldousI 1985; 



Pitaian|2006). 



3 Cyclic Approximations for Permanental Ratio 

3.1 Approximations based on cyclic expansion 

To apply the permanental classification model, we need to calculate the ratio 



per,{K(xUt)} 



or to calculate the cyclic ratio 



Cn{t;x) = '^^^fi'^.^.y = lim Rn{t;x) (8) 

for each labelled class or unlabelled block. An efficient algorithm is critical. We propose analytic 
approximations to the permanental ratio for classification applications. 

The a-permanent of the matrix K[{t, xi, . . . , is a sum over (n + 1)! terms. In a subset consist- 
ing of n! terms, the index t occurs in a cycle of length 1, giving rise to the partial sum 

aK{t, t) perQ,{i^(x)} . 

The index t may also occur in a cycle of length 2 such as {t,xi) or (^,^2) and so on. There are n\ 
permutations in which t occurs in a 2-cycle, giving rise to the additional sum 



aK{t, Xi)K{xi, t) peia{K{x^i)} , 



i=l 

where x-i is the set of n — 1 points with the ith element removed. Similarly, the index t may occur in a 
3-cycle such as (t, Xi, xj) or (t, Xj,Xi), giving rise to the sum 

aK{t, Xi)K{xi, Xj)K{xj,t) pev^{K{x^i^j)} . 

In the cyclic expansion of the permanent of order n + 1, there are n! terms in which t occurs in a 1-cycle, 
n! terms in which t occurs in a 2-cycle, n! terms in which t occurs in a 3-cycle, and so on up to cycles 
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of length n + 1. Therefore, we obtain the following finite expansion by cycles for ([7]l 

1 



Rn{t;x) = aK{t, t) + a 



. ^n—2\-^ji-^—%—'i, 
1 



, ■ Rn—si^k': X—i—j—k) 



K{t, Xi)K{xi, Xj)K{xj,t) 

K{t, Xi)K{xi,Xj)K{xj,Xk)K{xk,t) + 



This cyclic expansion suggests a recursive approximation in which 

R'^\t;x) = aK{t,t) 
is the uni-cycle approximation for n > 0; 

= aK{t,t)+aY,\K{t,x,)\yRl'^\{xi;: 

i 

= aK{t,t) + ^\K{t,Xi)\^/K{xi,Xi) 



is the two-cycle approximation for n > 1; 



i 

= aK{t,t)+a^ 



1 



\K{t,x,)\' 



\K{t,x.)\' + Y^ 



K{t, Xi)K{xi,Xj)K{xj,t) 



E 



K{t, Xi)K{xi,Xj)K{xj,t) 



is the three-cycle approximation for n > 2, and so on. The four-cycle approximation Rn\t;x) for 
n > 3 is 



aK{t,t) + ■ 



1 



Rnll{Xi', X-i) 



\K{t,x,)\' + Y,- 



K{t, Xi)K{xi,Xj)K{xj,Xk)K{xk,t) 

Rn—si^k'i X—i-j—k) 



K{t,Xi)K{xi,Xj)K{xj,t) + 



It is natural to let C^\t;x) = lim„_j.o+ R^\t;x) be the {k + l)-cycle approximation for C„(t;x). 
The two-cycle approximation Rn\t] x) or Cn\t; x) is a kernel function, which is an additive function 
of X, while the three-cycle approximation is not. 



For n = or a; = 0, Rn\t] x) = aK{t, t) = Rnit; x) is exact. For n = 1, 



R^^'> {t; x) = aK{t, t) + 



\Kit,Xi 



Rq ^ {^i'l X—i) 



,(t;x) 



In both cases, ci""* (t; x) = C„(t; x). By induction, we obtain in general 

Theorem 3.1 For n = 0, 1, 2, . . ., x) = Rn{t; x), and cii"'\t; x) = C„(t; x). 
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Up to = 3, that is, the four-cycle approximation, Rl^\t; x) is easy to compute, even for fairly large 
values of n. The time complexity is 0(n) for the two-cycle approximation, O(n^) for the three-cycle 
approximation, and 0(n'^) for the four-cycle approximation. For some special cases, the cychc approx- 
imation provides an exact value for Rn{t; x). 

Example 3.1 Let K{t, t') = Su' fit), which corresponds to diagonal matrices. Here f is some positive 
non-random function on X, and 5tt' = ^ift = t' and otherwise. Ift, Xi, . . . ,Xn are pairwise different, 
then for each k = 0, . . . ,n, 

Rn{t;x) = Ri^\t;x) = af{t), Cn{t;x) = C^rl'\t;x) = 0. 

Example 3.2 Let K{t, t') = cfor some constant c > 0, which corresponds to constant matrices. Then 
perQ,{i^(x)} = c"a(a -|- 1) • • • (a -|- n — 1). For each k = 1, . . . ,n, 

Rn{t\x) = R^n\t;x) = c{a + n), Cn{t;x) = c!^\t;x) = cn. 
Note that R^^ {t; x) = ca, C^^ (i; x) = 0. 

Example 3.3 Let K be a projection of rank v on X. That is, 

/ K{t,t)p{dt) = u, / K{s,t)K{t,u)p{dt) = K{s,u). 

Then the two-cycle approximation determines a probability density in the sense that it is non-negative 
and has unit integral: 

{n + au)-^ f R\^\t-x)ii{dt) = {n + au^^lav + I 'w^'^^'^t 

/ , N-if , K{xi,Xi) \ 

= 1. 

A similar argument shows that the three-cycle and four-cycle approximations also integrate to unity, but 
it is not clear whether they are non-negative. 

Theorem 3.2 Suppose n>2. (i) If the n x n matrix K{x) is diagonal, then 

Rnit; x) = r2^ {t;x) = --- = {t;x) = aK{t, t) + Y, . 



(ii) IfK{xi, Xj) = c i, j = 1, . . . , n, c / 0, then for k = 2, . . . ,n, 

+ n — ^-^ cla + n — 1) 

1=1 ^ ' i,j=l ^ ' 

(Hi) Suppose K{x) is block-diagonal with constant blocks. That is, there exist a partition B 2, . . . , n} 
and some constants Ch 0,b E B, such that, K{xi,Xj) = ci,ifi,j G b, and otherwise. Then for 
k = 2, . . . ,n. 



2 "^ijeb K{t,Xi)K(t,x 



3 1 



beB ^ ' ' ' beB 
\b\>2 
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Figure 1: Approximations of the permanental ratio Rn{t;x) (left panel) and 
exact probability that t belongs to class 1 (right panel) from Kou and McCul- 
lagh's estimate (solid), four-cycle (dot-dash), three-cycle (dot), and two-cycle 
(dash) approximations 



Based on Theorem l3.2[ the three-cycle or higher order cyclic approximation is exact if the n xn matrix 
K{x) is diagonal, constant, or block-diagonal with constant blocks. The (n + 1) x (n + 1) matrix 
K{t U x) may not be diagonal, constant, or block-diagonal. 



3.2 Accuracy of the cyclic approximations 

For n < 20, the accuracy of the approximation can be checked directly by comparison with the exact 
computation. Our experience is that the three-cycle approximation is adequate in this range, and the 
four-cycle approximation usually has negligible error. For larger values, say n > 50, the accuracy can be 
checked by examining special cases in which the permanent can be calculated exactly in reasonable time. 
For example, to calculate the a-permanent of a penta-diagonal matrice A, that is, Aij = for |i— j | > 2, 
three-cycle or higher order cyclic approximation is essentially exact. For more general matrices, the 
accuracy can be gauged to some extent from an examination of the sequence of approximations. 

The left panel of Figure 1 shows the approximate values of the permanental ratio ^ for a sample of 
100 x-values in (— vr, vr), plotted as a function of t in the same range. The 100 points are generated from 
the symmetric triangular distribution on (— vr, vr). For this example, a = 1, and K{t, t') = exp{— (t — 
t')^/r^} with r = 1. In the central peak, the lowest curve is the two-cycle approximation, and the next 
two curves are successive approximations up to four-cycle. Th e highest curve is the estimated values 



from the importance sampler described by iKou & McCuUaghl (2009, Section 4). The shape of these 



relative intensity functions depends fairly strongly on the value of r, but only slightly on a. In all cases, 
the difference between the three-cycle and four-cycle approximations is considerably smaller than the 
difference between the two-cycle and three-cycle ones. For q = r = 1, the four-cycle approximation is 
approximately 6% larger than the three-cycle in the central peak, while the three-cycle approximation 
is approximately 18% larg er than the two-cyc le one. On average, the relative differences between the 



cyclic approximations and iKou & McCuUaghl ' s importance sampling estimate are 19% for two-cycle. 



12% for three-cycle, and 10% for four-cycle approximations, respectively. 

To check the performance of our cyclic approximations for supervised classification applications. 
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we generate another 100 points from the symmetric triangular distribution on (vr, Sir) denoted by class 2 
and regard the first 100 points shown in Figure 1 as class I's. According to expression ([H), we can 
calculate the probability that a point with feature t belongs to class 1. The right panel of Figure 1 
pl ots the probabihties when the permanental ratios are calculated based on the cyclic approximations 



Kou & McCullaghl 's importance sampler. The differences among the four appr oximations are negl i- 



or 

gible. The maximum relative differences between the cyclic approximations and Kou & McCullaghl 's 
estimate are 4.3% for two-cycle, 3.4% for three-cycle, and 3.3% for four-cycle, respectively. If we re- 
generate class 2 from a symmetric triangular distribution on (O.Svr, 2.57r) which is overlapped with the 
region of class 1, the maximum relative differences can be as large as 44% and 14% for the two-cycle 
and three-cycle approximations, while the four-cycle approximation still works reasonably well with a 
maximum relative difference 4.2%. The worst cases usually occur at the boundary or the overlapped 
part (O.Svr, vr). Even for the overlapped distributions, the c orresponding maximum absolute differences 
between the cyclic approximations and lKou & McCullaghl 's estimate are 0.045 for two-cycle, 0.018 for 
three-cycle, and 0.023 for four-cycle approximations in terms of class probability. 

As for computation time, it took a personal computer with 2.8GHz CPU and 2GB RAM 1.3 seconds 
in total to finish all cal culations based on t wo-cycle, three-cycle and four-cycle approximations, or about 
700 seconds based on lKou & McCullaghl 's importance sampler with sample size 20,000. 



4 A Simulated Example 

We use an artificial example to illustrate how the proposed model with cyclic approximation works for 
a supervised classification problem. This example has two classes in a 3 by 3 chequer-board layout with 
classes labelled as follows. 



1 


2 


1 


2 


1 


2 


1 


2 


1 



The training dataset consists of 90 units, with 10 feature values uniformly distributed in each 1 by 1 
small square, as shown in Figure 2. We assume the two-class model based on permanent processes 
with ai = a2 = a and covariance function Ki{t,t') = exp(— |[t — t'||/r) or K2{t,t') = exp(— |[t — 
t' I P /r ^ ) . The calculations are based on the four-cycle approximation for the permanental ratio described 
in Section |3TT] The parameters a and r are chosen by 10-fold cross-validation. 

The left and middle panels of Figure 2 provide the contour plots of the probability that a new point 
is assigned to class 1. For the parameter values chosen, the range of predictive probabilities depends, 
to a moderate extent, on the configuration of x-values in the training sample, but the extremes are 
seldom below 0. 1 or above 0.9 for a configuration of 90 points with 10 in each small square. The range 
of predictive probabilities decreases as a increases, but the 50% contour line, that is, the solid line in 
Figure 2, is little affected, so the classification is fairly stable. The class boundaries based on Ki and K2 
are slightly different. The boundary based on Ki is more sensitive to the boundary points. In practice, 
one may use cross-validation to choose the optimal type of covariance function from several candidates. 
In this case, Ki works slightly better according to error rate and cross-entropy loss. 

The right panel of Figure 2 compares the class boundaries generated by the proposed model with 
Ki, a neural network method using single layer with 18 hidden units and weight decay 0.001 chosen by 
cross-vahdation, and a support vector machine using Gaussian kernel with tuning parameter chosen by 
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Contour plot based on K1 



Contour plot based on K2 



Classification boundaries 




Figure 2: Classification results. The contour plots of the probability that a new 
point is assigned to class 1 (round dots) based on permanental models Ki and 
K2 are shown in the left and middle panels. The boundary lines of classification 
based on iCi -permanental model (solid), neural network (dash), or support vector 
machine (dot) are shown in the right panel. 



cross-validation. Since we know the data-generating mechanism, we can evaluate the performance, and 
the Ki -permanental model performs best. 



Table 1 : Error counts out of 90 and 3600 respectively 



Classifier 


Training error 


Testing error 


Proposed model with Ki 





308 


Proposed model with K2 


5 


301 


Neural network 





334 


Support vector machine 





357 


Aggregate classification tree 





391 


A;-nearest neighbor 


6 


412 



Given that the correct classification is determined by the chequerboard rule, the error rates for train- 
ing data and 60 x 60 grid points serving as testing data are summarized in Table 1. For comparison 
purposes, some commonly used classifiers are listed in Table 1 too. In addition to the neural network 
method and support vector machine, we also check the results based on an aggregated classification tree 
with bagging number 100 and a fc-nearest neighbor classifier with k = 5 chosen by cross-validation. 
Diagonal linear discriminant analysis and logistic regression do not work for the original xi and X2 in 
this case, because the class regions are non-convex and interlaced. 



5 Microarray Analysis: Leukemia Dataset 



The leukemia dataset described by lGolub et al. (Il999h uses microarray gene expression levels for cancer 
classification. It consists of 72 tissue samples from two types of acute leukemia, 47 samples of type 
ALL and 25 of type AML. The version used here, from the R package golubEsets downloaded 
from htt p : / /bioconductor . org , contains expression levels for 7129 genes in each of 72 tissue 
samples. 
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(a) Two-Dimensional Display 



(b) Prediction Error on Average 




mean difference number of genes used 



Figure 3: Leukemia Data. Left panel: a two-dimensional display of the dataset with 
47 ALL (round dot), 25 AML (triangle), and a new observation (square). Right 
panel: number of test errors on average over 200 learning/testing partitions based 
on different methods, including support vector machine (point-down triangle), di- 
agonal linear discriminant analysis (point-up triangle), /c -nearest neighbor (round 
dot), permamental model with K2 covariance (diamond), permanental model with 
Ki covariance (square, overlapped with diamond). 



The left panel of Figure 3 shows a two-dimensional projection in which the x axis is the straight 
line joining the class centr oids, and the y axis is the first principal component. Unlike the usual heatmap 
display such as Fig. 3B in lGolub et al.l(ll999i) . each sample is plotted here as a single point. The goal is 
to classify each new tissue sample as ALL or AML based on the gene expression levels. 



The leukemia dataset has been widely used for testing classifiers. iDudoit et al.l (|2002h did a com- 
prehensive comparison of various discriminant methods using this dataset as well as two other popular 
microarray datasets. Based on their study, the nearest neighbor classifier and the diagonal linear dis- 
criminant analysis work the best when 40 selected genes are considered. 

To compare the performance of the proposed method w ith other methods, we follow the train- 
ing/testing partitioning procedure used by iDudoit et all (l2002h . The 72 samples are randomly divided 
into 48 training points and 24 testing points. Each classifier is fitted or trained using the 48 training 
points and tested using the 24 testing points. The number of misclassified points out of 24 is recorded. 
The procedure is repeated 200 times for each classifier. The number of test errors on average is used to 
evaluate the performance of classifiers. 

The right panel of Figure 3 shows the number of prediction errors on average over 200 random 
training/testing partitions. The genes used for discri minant analysi s are s elected according to the ratio 
of between-group variance to within-group variance (IDudoit et al.l. l2002l Section 3.4). The proposed 
models with Ki or K2 are cor npared with the two winners, A:-nearest neighbor and diagonal linear dis- 
criminant analysis methods, in lDudoit et al.l (120021) . as well as the support vector machine method which 
became popular more recently. As the number of selected genes increases, the mean number of test er- 
rors of the four classifiers follows a similar pattern. It decreases initially as more information becomes 
available for parameter estimation, but subsequently increases as the signal becomes lost in the noise. 
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The proposed models with Ki and K2 perform as well as the support vector machine, but better than the 
A;-nearest neighbor and diagonal linear discriminant analysis methods, in the sense of minimum average 
error count. Compared with the support vector machine, the proposed model performs reasonably well 
even with bad selection of covariates. It seems more capable of handling high-dimensional data. This is 
critical when the true classification relies on non-reducible high dimensional features. In terms of com- 
putational time, the proposed method is comparable with the neural network and support vector machine 
methods with moderate data size, but slower than the diagonal linear discriminant and /c-nearest neigh- 
bor methods. As the number of feature variables increases, the error rates increase for all classifiers, but 
more rapidly for the neural network and support vector machine than for either permanental classifier. 
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Appendix 

Proof of Theorem 13.21 We only need to prove case (iii). Because case (i) corresponds to \B\ = n, \b\ = 
1, while case (ii) corresponds to = 1, |6| = n. 

First if K{t U x) is also block-diagonal, then Rn{t; x) = Rn\t; x) = Cb{a + \b\) given i G 5, or = 
aK{t, t) given that t does not belong any block of B. Here k = 1,2, . . . ,n. Therefore, Rn{xi;x-i) = 
R'h\xi; x^i) = Cb{a + |6| - 1) given i G b; Rnixj; x_i_j) = R'h\xj; x_i_j) = Cb{a + \b\ - 2) given 
i, j ^ b,i ^ j; and so on. 

The formula for Rn{t;x) in case (iii) can be justified by applying mathematical induction on the 
cyclic expansion of Rn. For its cychc approximations, Rn\t; x) = aK{t, t)+J2beB Siefe l^i'^^ 
Rn{t;x). It is straightforward to verify that Rn\t;x) = Rn{t;x). The formula for Rn\t;x) with 
A; > 3 can be justified using the equation below with index I = 1,2, ... ,k — 2. 

^it-i 3^21 , Xi2 ) • • • K[xi^_^ , t) 



^k-l'Ftl,---,tk-l-l 



^ — lK{t,Xi^)K{xi,,Xi2)---K{xif^_^,t) + 

^ — K{t,Xi^)K{xi^,Xi2)■■■K{xi^_^^^,t)\, ii,...,ikGb 

*fe-i + l7^«li---;*fc-I 



□ 
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